Breast Cancer Wisconsin Diagnostic
Introduction
Context
[…]
1. Information about the dataset
“Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.”
fine needle aspirate (FNA) of a breast mass
\(~\) \(~\)
Ten real-valued features are computed for each cell nucleus:
| Name of the variables | type | Description |
|---|---|---|
| 1) ‘radius’ | num | distances from center to points on the perimeter |
| 2) ‘texture’ | num | standard deviation of gray-scale values |
| 3) ‘perimeter’ | num | perimeter of the nucleus |
| 4) ‘area’ | num | area of the nucleus |
| 5) ‘smoothness’ | num | local variation in radius lengths |
| 6) ‘compactness’ | num | \(perimeter^2 / area - 1.0\) |
| 7) ‘concavity’ | num | severity of concave portions of the contour |
| 8) ‘concave.points’ | num | number of concave portions of the contour |
| 9) ‘symmetry’ | num | symmetry of the nucleus |
| 10)‘fractal_dimension’ | num | \(coastline approximation - 1\) |
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. All feature values are recorded with four significant digits.
The 3-dimensional space is that described in: [3].
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
2. Question of interest
The aim is to predict whether the cancer is benign or malignant with only few features
3. Choice of data
There is a lot of data, we choose only the data concerning the mean. The following analysis has been done also for the worst-type of data and lead to similar conclusion. See in annex the inspection of these 2 subframes.
4. Plan of the analysis
- Inspect the data to understand it better
- Check if the sample proportion of benign and malignant cancer is representative of the whole population
- Select variables of interest by discarding according to the correlation level
- Built GLM models and select the best model with anova, discuss the goodness of this model
- Apply PCA to see to what extent dimension can be reduced and perform a GLM model with relevant dimension.
- Conclude by addressing the accuracy of these GLM models with a subset of the data
In annex, I built a GLM model with the library glmnet and I find the same selection of variable.
I. Preprocessing
1. Required libraries
# Check if packages are installed and if not to install them or just load them
if(!require(pacman)) {
install.packages(c("pacman", "remotes"))
}
if (!require(papaja)) {
remotes::install_github("crsh/papaja")
}
if(!require(pacman)) {
install.packages("equatiomatic")
}
if(!require(devtools)) {
install.packages("devtools")
install_github("kassambara/factoextra")
}
pacman::p_load(pander, # Rmarkdown visualization
GGally,
ggfortify,
ggplot2,
MASS, # for stepAIC
here,
kableExtra, # Rmarkdown table visualization
papaja,
glmnet, # GLM implementation and analysis library
equatiomatic, # Rmarkdown model equation visualization
patchwork, # arrange subplot
devtools, # tool for PCA
factoextra, # tool for PCA
)##
## Your package installed
## Warning in pacman::p_load(pander, GGally, ggfortify, ggplot2, MASS, here, : Failed to install/load:
2. Import data
path = here("LUCILE") # get relative path
setwd(path) # set working directory
df <-
read.csv('data.csv', stringsAsFactors = 1)# load data from github repositoryLet’s delete ID number and the last variable because there are not relevant.
df<-df[,-33]
df<-df[,-1]We will work only with the mean-type data, so let’s create a new frame for the variables of this type.
3. Inspection
Structure
str(df_mean)'data.frame': 569 obs. of 11 variables:
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
$ radius : num 18 20.6 19.7 11.4 20.3 ...
$ texture : num 10.4 17.8 21.2 20.4 14.3 ...
$ perimeter : num 122.8 132.9 130 77.6 135.1 ...
$ area : num 1001 1326 1203 386 1297 ...
$ smoothness : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ compactness : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ concavity : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ concave.points : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
$ symmetry : num 0.242 0.181 0.207 0.26 0.181 ...
$ fractal_dimension: num 0.0787 0.0567 0.06 0.0974 0.0588 ...
Head
pander(head(df_mean))| diagnosis | radius | texture | perimeter | area | smoothness | compactness |
|---|---|---|---|---|---|---|
| M | 17.99 | 10.38 | 122.8 | 1001 | 0.1184 | 0.2776 |
| M | 20.57 | 17.77 | 132.9 | 1326 | 0.08474 | 0.07864 |
| M | 19.69 | 21.25 | 130 | 1203 | 0.1096 | 0.1599 |
| M | 11.42 | 20.38 | 77.58 | 386.1 | 0.1425 | 0.2839 |
| M | 20.29 | 14.34 | 135.1 | 1297 | 0.1003 | 0.1328 |
| M | 12.45 | 15.7 | 82.57 | 477.1 | 0.1278 | 0.17 |
| concavity | concave.points | symmetry | fractal_dimension |
|---|---|---|---|
| 0.3001 | 0.1471 | 0.2419 | 0.07871 |
| 0.0869 | 0.07017 | 0.1812 | 0.05667 |
| 0.1974 | 0.1279 | 0.2069 | 0.05999 |
| 0.2414 | 0.1052 | 0.2597 | 0.09744 |
| 0.198 | 0.1043 | 0.1809 | 0.05883 |
| 0.1578 | 0.08089 | 0.2087 | 0.07613 |
Summary
pander(summary(df_mean))| diagnosis | radius | texture | perimeter | area |
|---|---|---|---|---|
| B:357 | Min. : 6.981 | Min. : 9.71 | Min. : 43.79 | Min. : 143.5 |
| M:212 | 1st Qu.:11.700 | 1st Qu.:16.17 | 1st Qu.: 75.17 | 1st Qu.: 420.3 |
| NA | Median :13.370 | Median :18.84 | Median : 86.24 | Median : 551.1 |
| NA | Mean :14.127 | Mean :19.29 | Mean : 91.97 | Mean : 654.9 |
| NA | 3rd Qu.:15.780 | 3rd Qu.:21.80 | 3rd Qu.:104.10 | 3rd Qu.: 782.7 |
| NA | Max. :28.110 | Max. :39.28 | Max. :188.50 | Max. :2501.0 |
| smoothness | compactness | concavity | concave.points |
|---|---|---|---|
| Min. :0.05263 | Min. :0.01938 | Min. :0.00000 | Min. :0.00000 |
| 1st Qu.:0.08637 | 1st Qu.:0.06492 | 1st Qu.:0.02956 | 1st Qu.:0.02031 |
| Median :0.09587 | Median :0.09263 | Median :0.06154 | Median :0.03350 |
| Mean :0.09636 | Mean :0.10434 | Mean :0.08880 | Mean :0.04892 |
| 3rd Qu.:0.10530 | 3rd Qu.:0.13040 | 3rd Qu.:0.13070 | 3rd Qu.:0.07400 |
| Max. :0.16340 | Max. :0.34540 | Max. :0.42680 | Max. :0.20120 |
| symmetry | fractal_dimension |
|---|---|
| Min. :0.1060 | Min. :0.04996 |
| 1st Qu.:0.1619 | 1st Qu.:0.05770 |
| Median :0.1792 | Median :0.06154 |
| Mean :0.1812 | Mean :0.06280 |
| 3rd Qu.:0.1957 | 3rd Qu.:0.06612 |
| Max. :0.3040 | Max. :0.09744 |
4. Proportion of benign vs malignant cancer
kable(prop.table(table(df_mean$diagnosis)),col.names = c("Severity","Frequency")) | Severity | Frequency |
|---|---|
| B | 0.6274165 |
| M | 0.3725835 |
The two types of cancer are not represented in the same proportion, this can lead to a bias.
However, this proportion representative of the reality:
“The benign to malignant ratio (B:M ratio) among breast biopsies (number of benign breast lesions divided by number of breast cancers) is widely believed to be around 4:1 or 5:1” [2]
II. Selection of variables of interest
Hypothesis:
According to the description of the data, radius, perimeter, area and compactness should be correlated since it exists a formula between these variables.
1. Correlation
ggpairs
ggpairs(
df_mean,
aes(color = diagnosis, alpha = 0.5),
upper = list(continuous = wrap(
"cor", size = 3, alignPercent = 1
)),
axisLabels = "none",
legends = TRUE
) +
labs(title = " Breath cancer features scatterplot matrix") +
theme(panel.grid = element_blank(), axis.ticks = element_blank()) Description :
- On the lower triangular part of the matrix, scatterplots of each pair of feature for benign (red) and malignant (blue) type are displayed.
- On the diagonal, the variable distribution is displayed.
- On the upper triangular part Pearson correlation is displayed.
Observations :
- In each subplot a distinction can be made according to the type of ‘diagnosis’.
- The observations coming from malignant cancer seem to be in general bigger than the data coming from benign cancer.
ggcorr
The following function permit to visualize better the correlation
ggcorr(
df_mean,
geom = "text",
nbreaks = 5,
hjust = 1,
label = FALSE,
label_alpha = 0.7
) + labs(title = " Breath cancer features Pearson correlation") + theme(legend.position =
"none")2. Variables selection
As expected, radius, perimeter and area are highly correlated (\(r \approx 1\))
Surprisingly, concavity, compactness and concave.points have a strong correlation.( \(r \approx 0.8\) or \(r \approx 0.9\))
Even if compactness is define as \(perimeter^2 / area - 1.0\)m the r between this variable and area or perimeter is not 1 because the correlation show only the linear dependency and their relation is not linear.
Variables discarded
The variables area, perimeter and compactness are well explained by the other variables ( Adjusted R-squared very close to 1). So we discard them.
df_mean_reduc <-df_mean[-c(4,5,7)]III. GLM
1. Set a first GLM model
We set a GLM model with the remaining features. why, why binomial : […]
m <-
glm(
data = df_mean_reduc,
diagnosis ~ radius + texture + smoothness + concavity + concave.points +
symmetry + fractal_dimension,
family = binomial
)
summary(m)##
## Call:
## glm(formula = diagnosis ~ radius + texture + smoothness + concavity +
## concave.points + symmetry + fractal_dimension, family = binomial,
## data = df_mean_reduc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35180 -0.13938 -0.03229 0.02046 3.15368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.38387 6.66946 -4.256 2.08e-05 ***
## radius 0.88701 0.21852 4.059 4.93e-05 ***
## texture 0.37262 0.06212 5.998 2.00e-09 ***
## smoothness 78.50170 32.64920 2.404 0.0162 *
## concavity 15.52082 8.35462 1.858 0.0632 .
## concave.points 46.67203 26.16265 1.784 0.0744 .
## symmetry 16.85783 10.75613 1.567 0.1170
## fractal_dimension -101.54448 61.26233 -1.658 0.0974 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 153.35 on 561 degrees of freedom
## AIC: 169.35
##
## Number of Fisher Scoring iterations: 8
2. Model selection
By performing several anova test, we will see that we can remove the features concavity, symmetry and fractal_dimension, because there is no significant difference between the model with theses variables and the one without.
fractal_dimension
m1 <-
glm(
data = df_mean_reduc,
diagnosis ~ radius + texture + smoothness + concavity + concave.points +
symmetry,
family = binomial
)
summary(m1)##
## Call:
## glm(formula = diagnosis ~ radius + texture + smoothness + concavity +
## concave.points + symmetry, family = binomial, data = df_mean_reduc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31157 -0.14140 -0.03545 0.02138 3.12564
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.26429 5.83732 -5.870 4.36e-09 ***
## radius 1.02392 0.20523 4.989 6.07e-07 ***
## texture 0.37400 0.06241 5.992 2.07e-09 ***
## smoothness 60.88592 30.33864 2.007 0.0448 *
## concavity 7.64062 7.05960 1.082 0.2791
## concave.points 50.19961 26.15735 1.919 0.0550 .
## symmetry 15.58198 10.71215 1.455 0.1458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 156.25 on 562 degrees of freedom
## AIC: 170.25
##
## Number of Fisher Scoring iterations: 8
anova(m, m1, test = "Chisq")## Analysis of Deviance Table
##
## Model 1: diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
## Model 2: diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 561 153.35
## 2 562 156.25 -1 -2.9022 0.08846 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The 2 models have a similar AIC
- From the anova test there is no significant difference between the 2 models
symmetry
m2 <-
glm(
data = df_mean_reduc,
diagnosis ~ radius + texture + smoothness + concavity + concave.points,
family = binomial
)
summary(m2)##
## Call:
## glm(formula = diagnosis ~ radius + texture + smoothness + concavity +
## concave.points, family = binomial, data = df_mean_reduc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28927 -0.15267 -0.03761 0.02390 3.03440
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -32.18048 5.69324 -5.652 1.58e-08 ***
## radius 0.99766 0.20762 4.805 1.55e-06 ***
## texture 0.36496 0.06131 5.953 2.64e-09 ***
## smoothness 72.72278 30.46830 2.387 0.0170 *
## concavity 10.21913 6.99120 1.462 0.1438
## concave.points 48.66262 26.54605 1.833 0.0668 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 158.34 on 563 degrees of freedom
## AIC: 170.34
##
## Number of Fisher Scoring iterations: 8
anova(m, m2, test = "Chisq")## Analysis of Deviance Table
##
## Model 1: diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
## Model 2: diagnosis ~ radius + texture + smoothness + concavity + concave.points
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 561 153.35
## 2 563 158.34 -2 -4.9887 0.08255 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The 2 models have a similar AIC
- From the anova test there is no significant difference between the 2 models
concavity
m3 <-
glm(
data = df_mean_reduc,
diagnosis ~ radius + texture + smoothness + concave.points,
family = binomial
)
summary(m3)##
## Call:
## glm(formula = diagnosis ~ radius + texture + smoothness + concave.points,
## family = binomial, data = df_mean_reduc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.42132 -0.15010 -0.04247 0.02603 2.86598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.57552 4.81406 -5.936 2.92e-09 ***
## radius 0.85081 0.17112 4.972 6.63e-07 ***
## texture 0.35845 0.05985 5.990 2.10e-09 ***
## smoothness 52.26403 26.08496 2.004 0.0451 *
## concave.points 78.73692 16.59332 4.745 2.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 160.32 on 564 degrees of freedom
## AIC: 170.32
##
## Number of Fisher Scoring iterations: 8
anova(m, m3, test = "Chisq")## Analysis of Deviance Table
##
## Model 1: diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
## Model 2: diagnosis ~ radius + texture + smoothness + concave.points
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 561 153.35
## 2 564 160.32 -3 -6.9717 0.0728 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The 2 models have a similar AIC
- From the anova test there is no significant difference between the 2 models
smoothness
m4 <-
glm(data = df_mean_reduc,
diagnosis ~ radius + texture + concave.points,
family = binomial)
summary(m4)##
## Call:
## glm(formula = diagnosis ~ radius + texture + concave.points,
## family = binomial, data = df_mean_reduc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.36043 -0.15742 -0.04644 0.02774 2.82699
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.16474 2.51027 -8.431 < 2e-16 ***
## radius 0.65637 0.12532 5.238 1.63e-07 ***
## texture 0.32593 0.05529 5.895 3.75e-09 ***
## concave.points 101.16839 13.02057 7.770 7.86e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.44 on 568 degrees of freedom
## Residual deviance: 164.38 on 565 degrees of freedom
## AIC: 172.38
##
## Number of Fisher Scoring iterations: 8
anova(m, m4, test = "Chisq")## Analysis of Deviance Table
##
## Model 1: diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
## Model 2: diagnosis ~ radius + texture + concave.points
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 561 153.35
## 2 565 164.38 -4 -11.035 0.02618 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The model without smoothness has a AIC much bigger than the other model
- From the anova test there is a significant difference between the 2 models
- We cannot remove the feature smoothness of the model.
3. Validation
The selected model is the following:
m_selected <-
glm(
data = df_mean_reduc,
diagnosis ~ radius + texture + smoothness + concave.points,
family = binomial
)
pander(summary(m_selected))| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -28.58 | 4.814 | -5.936 | 2.923e-09 |
| radius | 0.8508 | 0.1711 | 4.972 | 6.629e-07 |
| texture | 0.3585 | 0.05985 | 5.99 | 2.104e-09 |
| smoothness | 52.26 | 26.08 | 2.004 | 0.04511 |
| concave.points | 78.74 | 16.59 | 4.745 | 2.084e-06 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 751.4 on 568 degrees of freedom |
| Residual deviance: | 160.3 on 564 degrees of freedom |
With the formula:
extract_eq(
m_selected,
use_coefs = TRUE,
wrap = TRUE,
terms_per_line = 2
)\[ \begin{aligned} \log\left[ \frac { \widehat{P( \operatorname{diagnosis} = \operatorname{M} )} }{ 1 - \widehat{P( \operatorname{diagnosis} = \operatorname{M} )} } \right] &= -28.58 + 0.85(\operatorname{radius})\ + \\ &\quad 0.36(\operatorname{texture}) + 52.26(\operatorname{smoothness})\ + \\ &\quad 78.74(\operatorname{concave.points}) \end{aligned} \]
The ratio of the residual deviance by its degrees of freedom is \(160.3/564 = 0.284\) where the dispersion parameter is 1. [we want these two values close…]
diagnostic plot
autoplot(m_selected,1:4)[…]
See in annex an other GLM modelisation with glmnet less conservative.
IV. PCA
1. Variability explained by each PC
Let’s apply the principal components analysis to our features.
p <- prcomp(df_mean_reduc[,-1], scale=TRUE)Let’s see the percentage of variability explained by each principal components.
Graph
fviz_eig(
p,
addlabels = TRUE,
geom = c("bar", "line"),
barfill = "grey",
barcolor = "grey",
linecolor = "red",
ncp = 10
) +
labs(title = "Percentage of variability explained by each PC",
x = "Principal Components", y = "Percentage of variances") The 4 first components explain more than 90% of the data.
Table
pander(summary(p))| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| radius | -0.3589 | 0.5034 | -0.2736 | 0.02538 | 0.02717 |
| texture | -0.1731 | 0.3435 | 0.9012 | -0.04728 | 0.1932 |
| smoothness | -0.3858 | -0.361 | -0.1145 | -0.2599 | 0.7786 |
| concavity | -0.505 | 0.09141 | -0.0368 | -0.1644 | -0.4256 |
| concave.points | -0.5086 | 0.1908 | -0.1671 | -0.1031 | -0.06186 |
| symmetry | -0.3616 | -0.3141 | 0.09945 | 0.8707 | -0.01891 |
| fractal_dimension | -0.219 | -0.5957 | 0.2463 | -0.3657 | -0.4127 |
| PC6 | PC7 | |
|---|---|---|
| radius | 0.6285 | -0.3828 |
| texture | 0.01509 | 0.005236 |
| smoothness | -0.09876 | -0.1553 |
| concavity | -0.6008 | -0.4076 |
| concave.points | -0.01339 | 0.8139 |
| symmetry | 0.04041 | -0.02235 |
| fractal_dimension | 0.4819 | -0.01644 |
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| Standard deviation | 1.848 | 1.345 | 0.9083 | 0.695 | 0.5893 |
| Proportion of Variance | 0.488 | 0.2585 | 0.1178 | 0.06901 | 0.04962 |
| Cumulative Proportion | 0.488 | 0.7465 | 0.8644 | 0.9334 | 0.983 |
| PC6 | PC7 | |
|---|---|---|
| Standard deviation | 0.295 | 0.1784 |
| Proportion of Variance | 0.01243 | 0.00455 |
| Cumulative Proportion | 0.9954 | 1 |
The 4 first components explain more than 90% of the data.
2. Observations in PC plans
autoplot(p,x = 1,y = 2, data = df_mean_reduc,colour = "diagnosis",
loadings = TRUE,loadings.colour = "blue",loadings.label = TRUE)+
theme_minimal()By plotting the data in the plan of the first 2 principal components we see a clear separation between benign and malignant type of cancer. That mean that knowing the location in this plan of a new observation should allow to predict the severity of the cancer. Let’s try this approach by applying a GLM model to the first two PC.
Note: We tried to plot the data in all the possible pairs formed by the first four components. The clearest separation between the type of cancer occurs in the plan PC1-PC2
3. GLM model with PCA
Implementation
df_mean_pca <- cbind(df_mean_reduc, p$x)
glm_pca <-glm(data= df_mean_pca, df_mean_pca$diagnosis ~PC1+PC2, family = binomial)Validation
pander(summary(glm_pca))| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.6441 | 0.2034 | -3.166 | 0.001545 |
| PC1 | -3.155 | 0.3467 | -9.098 | 9.206e-20 |
| PC2 | 2.403 | 0.3032 | 7.924 | 2.294e-15 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 751.4 on 568 degrees of freedom |
| Residual deviance: | 162.5 on 566 degrees of freedom |
autoplot(glm_pca,1:4)+ theme_minimal()[…]
Conclusion
1.Features allowing to predict severity of cancer ?
[…] ## 2. test accuracy of the prediction with these feature
Annex
1. Correlation in the ‘standard deviation’ and ‘worst’ group
## 'data.frame': 569 obs. of 11 variables:
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension.: num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 'data.frame': 569 obs. of 11 variables:
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius : num 25.4 25 23.6 14.9 22.5 ...
## $ texture : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area : num 2019 1956 1709 568 1575 ...
## $ smoothness : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension.: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We get the same results as for the mean group. ## 2. STEP AIC
m5<-stepAIC(m,trace = TRUE)## Start: AIC=169.35
## diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
##
## Df Deviance AIC
## <none> 153.35 169.35
## - symmetry 1 155.78 169.78
## - fractal_dimension 1 156.25 170.25
## - concave.points 1 156.63 170.63
## - concavity 1 156.64 170.64
## - smoothness 1 159.85 173.85
## - radius 1 173.68 187.68
## - texture 1 203.11 217.11
m5$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
##
## Final Model:
## diagnosis ~ radius + texture + smoothness + concavity + concave.points +
## symmetry + fractal_dimension
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 561 153.3485 169.3485
3. glmnet
Let’s separate the dataset into train and test set
## 75% of the sample size
prop_train_test <- floor(0.75 * nrow(df_mean_reduc))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(df_mean_reduc)), size = prop_train_test)
#sample = sample.split(data$num, SplitRatio = .75)
#train = subset(data, sample == TRUE)
#test = subset(data, sample == FALSE)
train <- df_mean_reduc[train_ind, ]
test <- df_mean_reduc[-train_ind, ]
x_train <- train[,-1]
y_train <- train$diagnosis
x_test <- test[,-1]
y_test <- test$diagnosistol_length=length(levels(y_train))
# run glm : train and test with cross validation
cvfit<-cv.glmnet(as.matrix(x_train), y_train,family = "binomial", type.measure="class")
plot(cvfit)cvfit$lambda.min## [1] 0.01890567
assess<-assess.glmnet(cvfit,newx=as.matrix(x_test), newy=y_test, s='lambda.min')
confusion.glmnet(cvfit, newx =as.matrix(x_test), newy = y_test, s = 'lambda.min')## True
## Predicted B M Total
## B 79 11 90
## M 1 52 53
## Total 80 63 143
##
## Percent Correct: 0.9161
as.numeric(1-assess$class)## [1] 0.9160839
coef(cvfit, s="lambda.min")## 8 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -11.9549638
## radius 0.3363175
## texture 0.1750409
## smoothness .
## concavity .
## concave.points 63.6306102
## symmetry .
## fractal_dimension .
Let’s look with the first lambda:
cvfit$lambda.1se## [1] 0.05773519
assess<-assess.glmnet(cvfit,newx=as.matrix(x_test), newy=y_test, s='lambda.1se')
confusion.glmnet(cvfit, newx =as.matrix(x_test), newy = y_test, s = 'lambda.1se')## True
## Predicted B M Total
## B 80 13 93
## M 0 50 50
## Total 80 63 143
##
## Percent Correct: 0.9091
as.numeric(1-assess$class)## [1] 0.9090909
coef(cvfit, s="lambda.1se")## 8 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -6.91104517
## radius 0.20042878
## texture 0.07001007
## smoothness .
## concavity .
## concave.points 42.10066651
## symmetry .
## fractal_dimension .
4. knn
k nearst ? pas interpretable
#install.packages('FNN')
#install.packages('crosstable')
library(crosstable)
library('FNN')
dat_pred <- FNN :: knn(train =x_train,test= x_test,cl= y_train, k =10)
pander(table(x = y_test, y = dat_pred))| B | M | |
|---|---|---|
| B | 79 | 1 |
| M | 16 | 47 |
Version of R used
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] FNN_1.1.3 crosstable_0.3.2 factoextra_1.0.7.999
## [4] patchwork_1.1.1 equatiomatic_0.3.0 glmnet_4.1-3
## [7] Matrix_1.3-4 kableExtra_1.3.4 here_1.0.1
## [10] MASS_7.3-54 ggfortify_0.4.12 GGally_2.1.2
## [13] ggplot2_3.3.5 pander_0.6.4 devtools_2.4.3
## [16] usethis_2.1.5 papaja_0.1.0.9997 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 ellipsis_0.3.2 rprojroot_2.0.2 flextable_0.6.10
## [5] base64enc_0.1-3 fs_1.5.2 rstudioapi_0.13 farver_2.1.0
## [9] remotes_2.4.2 ggrepel_0.9.1 fansi_0.5.0 xml2_1.3.3
## [13] codetools_0.2-18 splines_4.1.1 cachem_1.0.6 knitr_1.36
## [17] pkgload_1.2.4 jsonlite_1.7.2 shiny_1.7.1 compiler_4.1.1
## [21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 fastmap_1.1.0
## [25] cli_3.1.0 later_1.3.0 htmltools_0.5.2 prettyunits_1.1.1
## [29] tools_4.1.1 gtable_0.3.0 glue_1.4.2 dplyr_1.0.7
## [33] Rcpp_1.0.7 jquerylib_0.1.4 vctrs_0.3.8 svglite_2.0.0
## [37] iterators_1.0.13 xfun_0.26 stringr_1.4.0 ps_1.6.0
## [41] testthat_3.1.1 rvest_1.0.2 mime_0.12 lifecycle_1.0.1
## [45] scales_1.1.1 promises_1.2.0.1 RColorBrewer_1.1-2 yaml_2.2.1
## [49] memoise_2.0.1 gridExtra_2.3 gdtools_0.2.3 sass_0.4.0
## [53] reshape_0.8.8 stringi_1.7.5 highr_0.9 desc_1.4.0
## [57] foreach_1.5.1 checkmate_2.0.0 pkgbuild_1.3.1 zip_2.2.0
## [61] shape_1.4.6 rlang_0.4.11 pkgconfig_2.0.3 systemfonts_1.0.3
## [65] evaluate_0.14 lattice_0.20-44 purrr_0.3.4 labeling_0.4.2
## [69] processx_3.5.2 tidyselect_1.1.1 plyr_1.8.6 magrittr_2.0.1
## [73] bookdown_0.24 R6_2.5.1 generics_0.1.0 DBI_1.1.2
## [77] pillar_1.6.3 withr_2.4.2 survival_3.2-11 tibble_3.1.5
## [81] crayon_1.4.1 uuid_1.0-3 utf8_1.2.2 rmarkdown_2.11
## [85] officer_0.4.1 grid_4.1.1 data.table_1.14.2 callr_3.7.0
## [89] rmdformats_1.0.3 forcats_0.5.1 digest_0.6.28 webshot_0.5.2
## [93] xtable_1.8-4 tidyr_1.1.4 httpuv_1.6.4 munsell_0.5.0
## [97] viridisLite_0.4.0 bslib_0.3.1 sessioninfo_1.2.2
References
\(~\) \(~\)